Load the necessary libraries
library(mgcv) #for GAMs
library(gratia) #for GAM plots
library(broom) #for tidy output#
library(emmeans) #for marginal means etc
library(MuMIn) #for model selection and AICc
library(tidyverse) #for data wrangling
library(DHARMa) #for simulated residuals
library(performance) #for residual disagnostics
library(see) # to visualize residual diagnostics
theme_set(theme_classic())
In a chapter on time series analysis, Reed et al. (2007) presented Hawaiian longitudinal waterbird survey data. These data comprise winter counts of various species of stilts, coots and moorehen along with year and the previous seasons rainfall. Here, we will explore the temporal patterns in the Kauai Moorhen.
Moorhen
Format of reed.csv data file
| year | stilt_oahu | stilt_maui | coot_oahu | coot_maui | moorhen rainfa | ll |
|---|---|---|---|---|---|---|
| 1956 | 163 | 169 | 528 | 177 | 2 | 15.16 |
| 1957 | 272 | 190 | 338 | 273 | NA | 15.48 |
| 1958 | 549 | 159 | 449 | 256 | 2 | 16.26 |
| 1959 | 533 | 211 | 822 | 170 | 10 | 21.25 |
| 1960 | NA | 232 | NA | 188 | 4 | 10.94 |
| 1961 | 134 | 155 | 717 | 149 | 10 1 | 9.93 |
| year | - a continuous predictor |
| stilt_oahu | - the abundance of the Oahu stilt |
| stilt_maui | - the abundance of the Maui stilt |
| coot_oahu | - the abundance of the Oahu coot |
| coot_maui | - the abundance of the Maui coot |
| moorhen | - the abundance of the Kauai moorhen |
| rainfall | - the number of centimeters (or inches) of rain |
reed_full <- read_csv('../data/reed.csv', trim_ws=TRUE) %>%
janitor::clean_names() %>%
rename(moorhen = moorhen_kauai)
## Parsed with column specification:
## cols(
## Year = col_double(),
## Stilt.Oahu = col_double(),
## Stilt.Maui = col_double(),
## Coot.Oahu = col_double(),
## Coot.Maui = col_double(),
## Moorhen.Kauai = col_double(),
## Rainfall = col_double()
## )
reed <- filter(reed_full, complete.cases(moorhen))
glimpse(reed)
## Rows: 45
## Columns: 7
## $ year <dbl> 1956, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1966, 1967…
## $ stilt_oahu <dbl> 163, 549, 533, NA, 134, 175, 356, 485, 242, 209, 175, 162,…
## $ stilt_maui <dbl> 169, 159, 211, 232, 155, 282, 170, 164, 253, 188, 226, 171…
## $ coot_oahu <dbl> 528, 449, 822, NA, 717, 12, 169, 98, 77, 106, 64, 15, 130,…
## $ coot_maui <dbl> 177, 256, 170, 188, 149, 205, 108, 79, 75, 80, 104, 122, 7…
## $ moorhen <dbl> 2, 2, 10, 4, 10, 12, 10, 8, 17, 7, 44, 50, 26, 10, 7, 1, 3…
## $ rainfall <dbl> 15.16, 16.26, 21.25, 10.94, 19.93, 12.63, 20.09, 10.02, 12…
ggplot(reed, aes(moorhen, x = year)) +
geom_point() +
geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cr"),
method.args = list(family = 'poisson'))
We can clearly see that it is a non-linear trend that is increasing through time, which would be difficult to model using any polynomials or worse, linear terms alone. Also note that the values are all 0+ integers, so a poisson or negative binomial family of distirbution would work well for these data!
ggplot(reed, aes(moorhen, x = rainfall)) +
geom_point() +
geom_smooth(method = 'gam', formula = y ~ s(x, bs = "cr"),
method.args = list(family = 'poisson'))
Model formula: \[ y_i \sim{} \mathcal{Pois}(\lambda_i)\\ log(\lambda_i) =\beta_0 + f(year_i) + f(rainfall_i) \]
where \(\beta_0\) is the y-intercept. \(f(year)\) and \(f(rainfall)\) indicate the additive smoothing functions of year and rainfall respectively.
We don’t need the thin-plate to be efficient because we don’t have a ton of data (as in, thousands of data points), thus we can use cubic regression.
reed_gam1 <- gam(moorhen ~ s(year, bs = "cr") + s(rainfall, bs = "cr"),
data = reed, family = poisson(link = "log"), method = "REML", select = TRUE)
k.check(reed_gam1)
## k' edf k-index p-value
## s(year) 9 8.581571 0.7744578 0.0350
## s(rainfall) 9 6.394206 1.3160101 0.9775
The k-check is saying that the rainfall is ok (edf below max k, i.e. k’), but the # of knots for the year smoother is potentially not enough (k-index less than 1!!), so we can double it to make sure that it is the same.
appraise(reed_gam1)
Clearly a big problem with the QQ plot, suggesting overdispersion! This will show up in the DHARMa residuals. Other plots look ok overall.
s <- simulateResiduals(reed_gam1, plot=T)
testDispersion(s) # very overdispersed
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 6.0372, p-value < 2.2e-16
## alternative hypothesis: two.sided
reed_gam2 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1) + s(rainfall, bs = "cr"),
data = reed, family = poisson(link = "log"), method = "REML", select = TRUE)
k.check(reed_gam2)
## k' edf k-index p-value
## s(year) 18 1.727006e+01 1.440891 0.995
## s(rainfall) 9 4.521326e-04 1.121432 0.745
k-index is not less than 1 now, so looks good.
DHARMa residuals still likely to look like shite:
appraise(reed_gam2)
s <- simulateResiduals(reed_gam2, plot=T)
Still looks bad…
testZeroInflation(s)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 0, p-value = 1
## alternative hypothesis: two.sided
No evidence of zero-inflation.
testDispersion(s)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.7367, p-value = 0.072
## alternative hypothesis: two.sided
Overdispersed, but technically not significantly so. Still, the QQ plot is terrible, so should probably address!!
reed_gam3 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1) + s(rainfall, bs = "cr"),
data = reed, family = nb(link = "log"), method = "REML", select = TRUE)
k.check(reed_gam3)
## k' edf k-index p-value
## s(year) 18 7.8195821 0.6980527 0.0175
## s(rainfall) 9 0.3391737 1.1980039 0.9350
Problem with this fit is that DHARMa doesn’t work with this fit!
simulateResiduals(reed_gam3)
So to use it, we need to manually extract theta, the dispersion parameter.
(theta <- reed_gam3$family$getTheta(TRUE))
## [1] 4.380278
There is more than 4x the amount of expected dispersion! Next, we can re-fit the model with theta, allowing DHARMa to work with our model.
reed_gam3 <- reed_gam3 %>% update(.~., family = negbin(link = "log", theta = theta))
k.check(reed_gam3)
## k' edf k-index p-value
## s(year) 18 7.8195757 0.6980527 0.0325
## s(rainfall) 9 0.3392386 1.1980041 0.9350
K-index is low, but notice that k’ and edf are distingly different, with a lower edf, thus we do NOT need to add knots! Remember, it is a combination of all three things that determines if we need to add knots or not.
s <- simulateResiduals(reed_gam3, plot=T)
testDispersion(s)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.41874, p-value = 0.064
## alternative hypothesis: two.sided
appraise(reed_gam3)
concurvity(reed_gam3)
## para s(year) s(rainfall)
## worst 2.85743e-30 0.7865911 0.7865911
## observed 2.85743e-30 0.2683207 0.4674018
## estimate 2.85743e-30 0.2267308 0.4248958
All looks good! QQ plot looks a bit weird, but not too too bad.
draw(reed_gam3, resid = T)
# # Using base R graphics:
# plot(reed_gam3,
# pages=1, shift=coef(reed_gam3)[1],
# trans=exp,
# resid=TRUE, cex=4, scale=0)
summary(reed_gam3)
##
## Family: Negative Binomial(4.38)
## Link function: log
##
## Formula:
## moorhen ~ s(year, bs = "cr", k = 9 * 2 + 1) + s(rainfall, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.81023 0.07793 48.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(year) 7.8196 18 176.829 <2e-16 ***
## s(rainfall) 0.3392 9 0.416 0.247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.813 Deviance explained = 81.7%
## -REML = 217.41 Scale est. = 1 n = 45
Evidence of a wiggly effect of year, no such evidence for rainfall, thus we may consider shifting rainfall to a linear term alone.
reed_gam4 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1) + rainfall,
data = reed, family = nb(link = "log"), method = "REML", select = TRUE)
reed_gam4 <- reed_gam4 %>% update(.~., family = negbin(link = "log", theta = theta))
summary(reed_gam4)
##
## Family: Negative Binomial(4.38)
## Link function: log
##
## Formula:
## moorhen ~ s(year, bs = "cr", k = 9 * 2 + 1) + rainfall
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.7986400 0.2183542 17.397 <2e-16 ***
## rainfall 0.0006351 0.0111731 0.057 0.955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(year) 7.912 18 178 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.808 Deviance explained = 81.7%
## -REML = 220.99 Scale est. = 1 n = 45
reed_gam5 <- gam(moorhen ~ s(year, bs = "cr", k = 9*2+1),
data = reed, family = nb(link = "log"), method = "REML", select = TRUE)
reed_gam5 <- reed_gam5 %>% update(.~., family = negbin(link = "log", theta = theta))
AICc(reed_gam3, reed_gam4, reed_gam5) %>% arrange(AICc)
So definitely no support for rainfall having much of any effect!
Conclusions:
tidy(reed_gam3)
newdata <- with(reed, list(year = modelr::seq_range(year, n=100),
rainfall = mean(rainfall, na.rm=T))) %>%
emmeans(reed_gam3, ~year, at = ., type = "response") %>%
as.data.frame %>%
mutate(moorhen = response, lwr = lower.CL, upr = upper.CL)
p <- ggplot(newdata, aes(y = moorhen, x = year)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "blue", alpha = 0.3) +
geom_line()
p + geom_point(data = reed)
These data are not strictly correct!, as they are after standardizing for rainfall (another variable in our model). Thus, when we have multiple values, we need to get the marginal plots.
We need to add back the residuals to the predicted to get the ‘observed’ values, after accounting for the effect of rainfall.
reed_obs <- with(reed_gam3$model,
data.frame(year = year, rainfall = mean(rainfall))) %>%
mutate(pred = predict(reed_gam3, newdata = ., type = "link"),
resid = reed_gam3$residuals,
moorhen = exp(pred + resid))
p + geom_point(data = reed_obs) +
geom_point(data = reed, col = "red")
Note the difference! Note that this is not the same as model shrinkage, even though the plot looks very similar to that in Statistical Rethinking.